#PART 1

Part 1 consists of retrieving CES 4.0 data for the Bay Area. Maps will be produced for the PM2.5 and Asthma rankings. Using the data methodology information in the CalEnviroScreen report, the context of the variables can be determined. Firstly, the asthma variable represents the age-adjusted rate of emergency department visits for asthma per 10,000 people (from 2015 to 2017). Alternatively, the PM2.5 variable is quantified by the annual mean PM2.5 concentration in micrograms per meter cubed (from 2015 to 2017). The plots can be seen below.

#PART 2

Next, a scatter plot was produced to compare PM2.5 (x-axis) and Asthma (y-axis). The plot can be seen below.

At this stage, the “fitness” of the best-fit line appears to be somewhat correlated with the data. Although, it appears that there is an inflection in the data around 8 on the PM2.5 axis. The data is not spaced evenly in the vertical direction to the line of best fit. Rather, a large concentration directly below the line is offset by a wide distribution above the line.

#PART 3

Now, a linear regression analysis using the lm() function will be completed. Ther results can be seen below.

## 
## Call:
## lm(formula = Asthma ~ PM2.5, data = ces4_map)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -54.349 -25.752  -9.579  12.833 183.031 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -115.55      13.01  -8.883   <2e-16 ***
## PM2.5          19.77       1.53  12.920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.35 on 1568 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.09621,    Adjusted R-squared:  0.09563 
## F-statistic: 166.9 on 1 and 1568 DF,  p-value: < 2.2e-16

The following can be said from the summary of the model:

An increase of 1 unit in PM2.5 is associated with an increase of 19.77 in Asthma; 9.62% of the variation in PM2.5 is explained by the variation in Asthma.

#PART 4

Now, the residual distrubtion will be plotted. This provides data about how skewed the data is (if at all).

As seen above, the distribution of the data is non-centered. In cases like this when the data is skewed, it can be better fit through a curve. Below is the model with a log transformation applied.

The fitness of the line appears to be a lot more centered on the data with a log transformation applied. There is a more distributed spread above and below the line of best fit. Likewise, a summary of the transformed model can be seen below.

## 
## Call:
## lm(formula = log(Asthma) ~ PM2.5, data = ces4_map)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.00455 -0.46320  0.03286  0.42216  1.75469 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.69911    0.22815   3.064  0.00222 ** 
## PM2.5        0.35557    0.02684  13.247  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.655 on 1568 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.1006, Adjusted R-squared:  0.1001 
## F-statistic: 175.5 on 1 and 1568 DF,  p-value: < 2.2e-16

An increase of 1 unit in PM2.5 is associated with an increase of exp(0.356) or 1.43 times in Asthma; 10.62% of the variation in PM2.5 is explained by the variation in Asthma.

#PART 5

Lastly, the residuals of the log transformed model can be seen below.

After applying a log-transformation to the y-axis, the distribution of the residuals is no longer skewed and is much more centered. These residuals can be plotted for the entire Bay Area, which can be seen on the map below.

In order to determine the tract where the residuals are most negative, a summary of the residuals column of the data frame can be seen below. After sorting the data frame, the most negative residual occurs in row 758. The location of this tract can be seen in the map below.

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
## -2.00455 -0.46320  0.03286  0.00000  0.42216  1.75469        1

Based on the above analysis, the area with the most negative residuals is in Stanford, California. This area represents the east portion of the Stanford University campus. It is important to note that the west portion of the Stanford campus was the second most negative residual. An area with the lowest residuals means that data falls the furthest below the line of best fit (when compared to the population of data). This means there is an under-representation of Asthma when compared to the average. This could be because of the long term effects not being seen often as students are constantly moving in and out of the campus.